Information Theoretic Kernel Integration -

نویسندگان

  • Yiming Ying
  • Kaizhu Huang
  • Colin Campbell
چکیده

In this paper we consider a novel information-theoretic approach to multiple kernel learning based on minimising a Kullback-Leibler (KL) divergence between the output kernel matrix and the input kernel matrix. There are two formulations which we refer to as MKLdiv-dc and MKLdiv-conv. We propose to solve MKLdiv-dc by a difference of convex (DC) programming method and MKLdivconv by a projected gradient descent algorithm. The effectiveness of the proposed approaches is evaluated on a benchmark dataset for protein fold recognition and a yeast protein function prediction problem. 1 Information-theoretic Data Integration In this paper we consider the problem of integrating multiple data sources using a kernel-based approach. Recent trends in learning kernel combination are usually based on the margin maximization criterion used by Support Vector Machines (SVMs) or variants [5, 8, 9, 10, 14, 16, 17]. There, each data source can be represented by x = {xi : i ∈ Nn} for ` ∈ Nm and the outputs are similarly denoted by y = {yi : i ∈ Nn}. With kernel methods, for any ` ∈ Nm, each `-th data source can be encoded into a candidate kernel matrix denoted by K` = (K`(xi , x ` j))ij . Depending on the type of data source used, the candidate kernel function K` would be specified a priori, as a graph kernel for graph data or a string kernel for sequence data, for example. The composite kernel matrix is given by Kλ = ∑ `∈Nm λ`K`. Hence, in this context the problem of data integration is reduced to learning a convex combination of candidate kernel matrices with kernel coefficients or weights denoted by λ. We can quantify the similarity between Kλ and the output kernel Ky through a Kullback-Leibler (KL) divergence or relative entropy term [3, 6, 7, 12, 13]. There is a simple bijection between the set of distance measures in these data spaces and the set of zero-mean multivariate Gaussian distributions [3]. Using this bijection, the difference between two distance measures, parameterized by Kλ and Ky, can be quantified by the relative entropy or Kullback-Leibler (KL) divergence between the corresponding multivariate Gaussians. Kernel matrices are generally positive semidefinite and thus can be regarded as the covariance matrices of the Gaussian distributions. Matching kernel matrices Kλ and Ky, can therefore be realized by minimizing a KL divergence between these two distributions. As described in [3, 6, 13], the Kullback-Leibler (KL) divergence (relative entropy) between a Gaussian distribution N (0,Ky) with the output covariance matrix Ky and a Gaussian distribution N (0,Kx) with the input kernel covariance matrix Kx is defined by KL (N (0,Ky)||N (0,Kx) ) := 1 2 Tr(KyK−1 x ) + 1 2 log |Kx| − 1 2 log |Ky| − n 2 . (1) Here, the notation Tr(B) denotes its trace. Though KL (N (0,Ky)||N (0,Kx) ) is non-convex w.r.t. Kx, it has a unique minimum at Kx = Ky if Ky is positive definite, suggesting that minimizing the above KL-divergence encourages Kx to approach Ky. If the input kernel matrix Kx is represented by a linear combination of m candidate kernel matrices, i.e. Kx = Kλ = ∑ `∈Nm λ`K`, the above

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Some Equivalences between Kernel Methods and Information Theoretic Methods

In this paper, we discuss some equivalences between two recently introduced statistical learning schemes, namely Mercer kernel methods and information theoretic methods. We show that Parzen window-based estimators for some information theoretic cost functions are also cost functions in a corresponding Mercer kernel space. The Mercer kernel is directly related to the Parzen window. Furthermore, ...

متن کامل

Kernel Principal Components Are Maximum Entropy Projections

Principal Component Analysis (PCA) is a very well known statistical tool. Kernel PCA is a nonlinear extension to PCA based on the kernel paradigm. In this paper we characterize the projections found by Kernel PCA from a information theoretic perspective. We prove that Kernel PCA provides optimum entropy projections in the input space when the Gaussian kernel is used for the mapping and a sample...

متن کامل

Spectral Modes of Facial Needle-Maps

Abstract. This paper presents a method to decompose a field of surface normals (needle-map). A diffusion process is used to model the flow of height information induced by a field of surface normals. The diffusion kernel can be decomposed into eigenmodes, each corresponding to approximately independent modes of variation of the flow. The surface normals can then be diffused using a modified ker...

متن کامل

Information Theoretic Clustering using Kernel Density Estimation

In recent years, information-theoretic clustering algorithms have been proposed which assign data points to clusters so as to maximize the mutual information between cluster labels and data [1, 2]. Using mutual information for clustering has several attractive properties: it is flexible enough to fit complex patterns in the data, and allows for a principled approach to clustering without assumi...

متن کامل

Blind Signal Processing based on Information Theoretic Learning with Kernel-size Modification for Impulsive Noise Channel Equalization

This paper presents a new performance enhancement method of information-theoretic learning (ITL) based blind equalizer algorithms for ISI communication channel environments with a mixture of AWGN and impulsive noise. The Gaussian kernel of Euclidian distance (ED) minimizing blind algorithm using a set of evenly generated symbols has the net effect of reducing the contribution of samples that ar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009